Introduction

The popularity of Citi Bike is definitely one reason that makes New York City (NYC) unique among other cities in United States. It is a bicycle sharing system opened in May 2013 in NYC with 332 stations. Until now, there are 750 stations and 12,000 bikes serving Manhattan, Queens and Brooklyn (also including New Jersey, but for analysis purpose of this project, we only investigate Citi Bike data in NYC). However, we usually see way more people using Citi Bike around Midtown than Columbia University (CU). Is this because the limited number of stations or available bikes around CU or people around CU do not prefer to use sharing bicycles? We are interested in the spatial distribution of stations and number of users. We will use R to mainly analyze Citi Bike data in one particular month in order to understand the transportation of sharing bike and other interesting questions: if weather and season have an impact on Citi Bike frequency; Is there any trip pattern per week.

Team members and corresponding contributions:

Description of data

Our data is from Motivate.
Motivate is the most experienced bike share team.They launched CitiBike in 2013 in NYC.
Also, they collect System data as CSV files, which can be downloaded from the link above.
In this final project, we analyze the data from June, 2018 because we think it is a common time period when people prefer to ride bikes under relatively mild weather, which can be very representative.
We do realize that season and time will have an impact on people’s choice on riding bikes, so in our analysis, we will include other data file such as weather information from other sources in order to analyze the time and seasons. Besides, the website provides data from June 2013 so we can analyze the time series pattern as well. When it comes to time series, we downloaded data from the source from 2017 June to 2018 June, and sampled each dataset with randomly choosing 5% of the data. Then we concat them together. The code combine the dataset is written in python and is also included in the github repo named sampledata.py.
The feature includes:

Analysis of data quality

Load Data

#df <- read.csv('./201806-citibike-tripdata.csv')
load("bike.RData")
df = data_18_6

Definition on outliers

We choose dataset of 201806 as mentioned in Part.2. The unit of trip duration in the dataframe is second which is not suitable for analyzing. Thus, we transform the unit into minutes by dividing 60. Since the data is bike.

df <- df %>% mutate(minute=tripduration/60)
overalltripduration <-  ggplot(data=df,mapping=aes(x=1,y=minute)) + geom_boxplot(fill='Orange')+ggtitle('Distribution of time duration')
overalltripduration + theme(plot.title = element_text(lineheight=.8, face="bold"))

From the above graph boxplot, we see that the outlier such as time below 0 and the outliers may cause trouble in our analysis. Here, we define that timeduration < 0 and time duration over 120 as outliers. In the above mentioned code, since the ratio of time duration over 120 is about 0.3%. We can confidently remove those data. Because normally one do not use a bike for that long.

df <- subset(df,minute<120 & minute>0) 
overalltripduration2 <-  ggplot(data=df,mapping=aes(x=1,y=minute)) + geom_boxplot(fill='Orange')+ggtitle('distribution of time duration') 
overalltripduration2 + theme(plot.title = element_text(lineheight=.8, face="bold")) 

overalltripduration3 <- ggplot(data=df,mapping=aes(minute)) + geom_histogram(fill='Orange',color='black')+ggtitle('distribution of time duration') 
overalltripduration3 + theme(plot.title = element_text(lineheight=.8, face="bold")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are two stations outside New York state (in Montreal, Canada). Below is the Citi Bike locations. Apprently those are outliers, so we remove them.

summary = df %>%
  group_by(start.station.id) %>%
  summarise(lat=as.numeric(start.station.latitude[1]),
            long=as.numeric(start.station.longitude[1]),
            name=start.station.name[1],
            n.trips=n())

leaflet(summary) %>%setView(-74.00, 42.71, zoom = 6) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircles(lng = ~long, lat = ~lat, col = "#fb6a4a") 
id = which(df$start.station.latitude > 45)
df = df[-id,]

Analysis of high frequency of birth year 1969

df$factor_year = as.factor(df$birth.year)
summary_18_6 = df %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) 
summary_18_6 = mutate(summary_18_6, time = rep("201806", nrow(summary_18_6)))

ggplot(summary_18_6, aes(x = factor_year, y = Freq)) + geom_point(col = "#fb6a4a") +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in 2018/6")

One interesting observation is that the mode of user’s birth year is 1969 (49 years old now) in 2018 June. So it is worth to discuss what is going on in that group of users because the frequency is way more higher than any other birth year users. Thus, we looked at dataset from 2016/6, 2017/6, 2017/12, 2018/1 and 2018/5.

# data_16_6 = read.csv("./data/201606-citibike-tripdata.csv")
# data_17_6 = read.csv("./data/201706-citibike-tripdata.csv")
# data_17_12 = read.csv("./data/201712-citibike-tripdata.csv")
# data_18_1 = read.csv("./data/201801-citibike-tripdata.csv")
# data_18_5 = read.csv("./data/201805-citibike-tripdata.csv")
# data_18_6 = df
data_18_5 = data_18_5 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_17_6 = data_17_6 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_16_6 = data_16_6 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_17_12 = data_17_12 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_18_1 = data_18_1 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_18_6$factor_year = as.factor(data_18_6$birth.year)

summary_18_6 = data_18_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_6 = mutate(summary_18_6, time = rep("201806", nrow(summary_18_6)))


data_17_6$factor_year = as.factor(data_17_6$birth.year)
summary_17_6 = data_17_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_17_6 = mutate(summary_17_6, time = rep("201706", nrow(summary_17_6)))

data_16_6$factor_year = as.factor(data_16_6$birth.year)
summary_16_6 = data_16_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_16_6 = mutate(summary_16_6, time = rep("201606", nrow(summary_16_6)))

data_17_12$factor_year = as.factor(data_17_12$birth.year)
summary_17_12 = data_17_12 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_17_12 = mutate(summary_17_12, time = rep("201712", nrow(summary_17_12)))

data_18_1$factor_year = as.factor(data_18_1$birth.year)
summary_18_1 = data_18_1 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_1 = mutate(summary_18_1, time = rep("201801", nrow(summary_18_1)))

data_18_5$factor_year = as.factor(data_18_5$birth.year)
summary_18_5 = data_18_5 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_5 = mutate(summary_18_5, time = rep("201805", nrow(summary_18_5)))


table = rbind(summary_16_6[1:3,], summary_17_6[1:3,], summary_17_12[1:3,],
              summary_18_1[1:3,], summary_18_5[1:3,],summary_18_6[1:3,])

table$factor_year= as.character(table$factor_year)
table[1,1] = "NULL"
table$factor_year = factor(table$factor_year)
ggplot(table, aes(x = factor_year, y = Freq)) +
  geom_bar(stat='identity', fill = "#fb6a4a", alpha = 0.8) +
  facet_wrap(~factor(time)) +
  xlab("users' birth year") +
  ylab("frequency") +
  ggtitle("Top Three Users' Birth Year")

Time Top Three Birth Year Rank of birth year 1969
2016/6 NA(191,500), 1985(54,230), 1988(52,674) 24
2017/6 NA(188,803), 1989(65,898), 1988(65,124) 25
2017/12 NA(38,928), 1985(32,877), 1989(32,818) 25
2018/1 1969(39,540), 1989(26,741), 1990(26,724) 1
2018/5 1969(215,754), 1988(70,506), 1989(70,506) 1
2018/6 1969(218,208), 1988(79,053), 1989(76,515) 1

We looked at the registration page of Citi Bike. Now they do not allow users to leave birth date as blank when register. So it is resaonable to assume that NA birth year of data have been changed to 1969 since Jan 2018, which is the reason that frequency of 1969 in June 2018 is extremely high. We do not think replacing NA with 1969 is a good strategy to deal with NA data because 1969 is not an empty birth year. There are quite amount of users who born in 1969 (around 10%), it is then making our analysis of birth year difficult.

Overall, data quality of our dataset is pretty good. The outlier we defined, which is time duration over 120 minutes, only accounts for 0.3% data in our dataset. From this point of view, we can drop out those rows.

Main analysis (Exploratory Data Analysis)

1. Analysis on Age distribution of our customers

df_2month = full_join(summary_18_6, summary_17_6, by = "factor_year")
## Warning: Column `factor_year` joining factors with different levels,
## coercing to character vector
df_3month = full_join(df_2month, summary_16_6, by = "factor_year")
## Warning: Column `factor_year` joining character vector and factor, coercing
## into character vector
t1 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.x, time = df_3month$time.x)
t2 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.y, time = df_3month$time.y)
t3 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq, time = df_3month$time)

tf = rbind(t1,t2,t3)

ggplot(tf, aes(x = factor_year, y = Freq, color = as.factor(time))) + geom_point() +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in different months")+
  labs(fill="Time")
## Warning: Removed 39 rows containing missing values (geom_point).

This Cleveland Dot Plot shows the distribution of user’s birth year in 2016/6, 2017/6 and 2018/6. They are all left-skewed, which is what we expect. It is glad to see number of users for all ages increases from 2016 to 2018. In 2018 June, the active users are aged between 24 and 37.

set.seed(321)
index = sample(nrow(df), round(nrow(df)*0.001))
sample = df[index,]
g = ggplot(sample, aes(birth.year,minute))
g + geom_point(alpha = 0.5, col = "#fb6a4a") + ylab("trip duration (in min)") +
  xlab("users' birth year") + ggtitle("Scatterplot of uses' birth year and trip duration (in sec) in Jun 2018")

There are 1.945,611 observations in our data set, so we sampled 1% of them to see the relationship between user’s birth of year and trip durations.
We were expect to see an increasing and then decreasing trend as in the distribution of users’ birth year (right-skewed). However, there is no obvious trend in the scatterplot. Again, since NA birth year were transfered to 1969, there are a lot observations in 1969 and its covers a wide range of trip durations, which is misleading.

2. Analysis on Gender of our customers

gender = data.frame(group = c("unkown", "male", "female"),
                    value = summary(as.factor(df$gender)))
pie = ggplot(gender, aes(x="", y = value, fill=group)) +
  geom_bar(width = 1, stat="identity") + 
  coord_polar("y", start = 0)

blank_theme <- theme_minimal()+
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid=element_blank(),
  axis.ticks = element_blank(),
  plot.title=element_text(size=14, face="bold")
  )

pie + scale_fill_manual(values=c("#fbb4ae", "#b3cde3", "#ccebc5")) + 
  blank_theme + theme(axis.text.x=element_blank()) +
  geom_text(aes(y = value/3 + c(0, cumsum(value)[-length(value)]), 
                label = c("10%","66%", "24%")), size=5) + 
  ggtitle("Ratio of users' gender of Citi Bike ")

There are 66% male users, 24% female users and 10% unkown. Apprently, there is a big differnece betwee female users and male users. We have seen some news report that Citi Bike users sometimes complain about the weight of citibike. Maybe that is one of the reasons that female users are small. Thus, we can think of some strategies for female users to increase profits because there is indeed room for improvement.

3. Analysis on Station Distribution and Corresponding Number of Bikes

Below is the heatmap of number of trips for each station. We also label the top 10 popular stations

## get station info 
station.info <- data_18_6 %>%
  group_by(start.station.id) %>% 
  summarise(lat=as.numeric(start.station.latitude[1]),
            long=as.numeric(start.station.longitude[1]),
            name=start.station.name[1],
            n.trips=n())

pal = colorNumeric(
  palette="YlOrRd",
  domain = station.info$n.trips
)

top10 = station.info %>% dplyr::arrange(desc(n.trips))


icon <- makeIcon(
  iconUrl = "citi_bike_icon.png",
  iconWidth = 20, iconHeight = 20,
  iconAnchorX = 22, iconAnchorY = 22,
  shadowWidth = 20, shadowHeight = 20,
  shadowAnchorX = 4, shadowAnchorY = 62
)

leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
  leaflet::addProviderTiles("CartoDB.Positron") %>%
  leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
  leaflet::addLegend("bottomleft", pal = pal, value = ~n.trips, title = "Num of trips starting from this station") %>%
  addMarkers(lng = ~top10[1:10,]$long, lat = ~top10[1:10,]$lat,  icon = icon)

We can see that there are more users starting their trips in Manhattan. In Manhattan, Midtown is the most popular area. The station near Central Park is also a popular spot, probably because there are some tourists riding inside Central Park. Back to our first question in the introduction, we are wondering why we do not see a lot of people around campus riding with Citi Bike. The number of stations around campus is not limited, but the starting rides from those stops are small. One possible explanation we believe is that there are quite amount of students living in dorms or very close to campus. So they do not need bicycles to transport.

4. Analysis on weather and season

Join weather data to see if weather has an impact

Is it reasonable to assume that weather and season has a huge impact on CitiBike frequency.

#NYCweather<-read.csv("./data/NYCweather.csv")
NYCweather<-read.csv("NYCweather.csv")
df$Date <- as.Date(df$starttime)
NYCweather$Date<-as.Date(NYCweather$Date)
df_weather <- inner_join(df,NYCweather,by="Date")
df_weather$Severe <- as.factor(df_weather$Severe)
hist(df_weather$Date, breaks="days", freq=TRUE)

ggplot(df_weather, aes(x=Date))+
  geom_histogram(colour="white", fill="steelblue")+
  ggtitle("Number of trips by day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df_weather, aes(x=Severe, y=))+
  geom_bar(colour="white", fill="steelblue")+
  ggtitle("Number of trips per day for severe and unsevere weather condition")

ggplot(data = df_weather,
       aes(x = Windspeed, y = minute)) +
  scale_x_continuous("Windspeed") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Windspeed") + 
  stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1) 

ggplot(data = df_weather,
       aes(x = Temperature, y = minute)) +
  scale_x_continuous("Temperature") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Temperature") + 
  stat_summary(fun.y="mean", geom="line", colour="steelblue", size=1) 

ggplot(data = df_weather,
       aes(x = Precipitation, y = minute)) +
  scale_x_continuous("Precipitation") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Precipitation") +
  stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1)

ggplot(data = df_weather,
       aes(x=factor(Severe), y=minute))+
  xlab("Severity of weather")+ylab("Average trip length")+
  stat_summary(fun.y="mean", geom="bar")

df_by_temp <- df_weather %>% group_by(Temperature) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_temp, aes(x=as.numeric(Temperature), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Temperature")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different temperatures")
## No summary function supplied, defaulting to `mean_se()

df_by_windspeed <- df_weather %>% group_by(Windspeed) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_windspeed, aes(x=as.numeric(Windspeed), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Windspeed")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different windspeeds")
## No summary function supplied, defaulting to `mean_se()

df_by_precip <- df_weather %>% group_by(Precipitation) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_precip, aes(x=as.numeric(Precipitation), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Precipitation")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different precipitations")
## No summary function supplied, defaulting to `mean_se()

Interactive time series to see if season has an impact on customer flow with Dygraph

We combine over a year time series to see if season has an impact on customer flow with Dygraph. The data is from 201706 to 201806.

library(dygraphs)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
## 
##     addLegend
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:memisc':
## 
##     is.interval
## The following object is masked from 'package:base':
## 
##     date
library(zoo)
#df_timeseries <- read.csv('./data/data_2017_2018.csv')
df_timeseries <- read.csv('data_2017_2018.csv')
a <- lubridate::mdy_hms(df_timeseries$starttime)
## Warning: All formats failed to parse. No formats found.
b <- as.Date(df_timeseries$starttime)
b[is.na(b)] <- a[!is.na(a)]
df_timeseries$starttime <- b
df_timeseries <- df_timeseries %>% mutate(time = format(as.Date(starttime), "%Y-%m"))
df_timeseries <- df_timeseries[,c("time","usertype")]

df_subscriber <- df_timeseries %>% subset(usertype == 'Subscriber')
df_customer <- df_timeseries %>% subset(usertype == 'Customer')
df_subscriber <- df_subscriber %>% group_by(time) %>% summarise(Number = n())
df_customer <- df_customer %>% group_by(time) %>% summarise(Number = n())

ts_subscriber <- xts::xts(df_subscriber$Number,as.Date(as.yearmon(df_subscriber$time)))
ts_customer <- xts::xts(df_customer$Number,as.Date(as.yearmon(df_customer$time)))
ts <- cbind(ts_subscriber,ts_customer)
dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))

5. Analysis about customer flow

library(lubridate) 
df_timeslot <- mutate(df,hour = hour(as.POSIXct(starttime)))
df_timeslot <- mutate(df_timeslot,wday = wday(as.POSIXct(starttime)))
df_timeslot <- df_timeslot[,c("hour","wday","minute","start.station.name","usertype")]
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by usertype')
timeslot + facet_wrap(~usertype)

From the above analysis, the overall time slot has two peaks: in 8-9 am and 5-6 pm, which is reasonable since they are the rush hour, which has a large number of passenger flow.
However, the time slot distribution differs a lot with aspect to different usertype and different weekdays.
Take the above images as examples, the customer has a higher peak in the time slot which is around 4 pm while the subscribers has two peaks.

timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour)))  +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by weekdays')
timeslot + facet_wrap(~as.factor(wday))

Also, the weekdays play a big role in the time duration distribution. On weekdays, we still sees similar two peaks. However, on weekends, we can only observe one peak. Thus, we can think some specific pricing strategy for a specific time slot in the weekdays.

Executive summary (Presentation-style)

Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. The length should be approximately two pages (if we were using pages…) Take extra care to clean up your graphs, ensuring that best practices for presentation are followed. * time slot * weather & season

Interactive component

Our data interactive component is written with Shiny and published out at shinyapps.io. Here is the Link

Our interactive part mainly focuses on the geo-based data analysis. In other words, given specific start (and end ) stations of citibikes, we aim to find some bike using patterns behind and try to understand and explain them in terms of users’ perspectives. In addition, we also provide some business advice based on the interactive analysis results.

Because there are a lot of stations in New York (785 in the datasets we use in the interactive part), the popularity of different stations may vary a lot from one to anther. For decision makers, it would be necessary to know the users demands and prefences to different bike stations and thus making more resonable station construction and bike allocation plans. For users, it might be useful to know the general aeverage riding time from stations A to station B. Also, given the start station, we hope to find the top 5~30 frequent end stations users tend to return the bikes to from the specific start station, which would also help the decision makers to do some research on the different types ouser habits. I’ll offer you some basic informations about this shiny app and show you how to use the interactive app to to data analysis as follows.

The dataset of the interactive part is from July 2018. We mark all start staions of the dataset on the map, which will exhibit different clusters based on the zoom scales. Users could see a specific station on the map by zooming in or clicking the cluster to yield smaller clusters and individual stations. The staion’s name will show up when mouseovering a specific individual station. Users could click the station marker to see the bar plot named “Time slot in the data”, which displays average/total bike usage within different time in a day.

Moreover, by selecting or searching a specific “Start Stations” in the select box in the bottom of the web page, users could see the corresponding bar plot of top 10 “End Stations”. It’s also convient to change the range of top popular “End Stations” by dragging slider bar to see different

Conclusion

After doing the analysis of Citi Bike data, we are surprised that there are total 1.95 million ridings in June 2018. It is glad to see more people choose this healthy mode of transport. We have found some results: The active users in June 2018 is between 24 and 37 years old; More than three fifth of users are female; In June 2018, Popular stations are in Midtown in manhattan. (other results…)

limitation:

future directions: